Contributors:

William Loving (wfl9zy) James Sweat (jes9hd)

Goals:

  1. Explore and visualize the broader Computer Science / Data Analysis industry fields.
  2. Discover interesting correlations between attributes of available jobs using multiple different Datasets.
  3. Learn how to develop meaningful visualizations to communicate the data we have to an uninformed audience.

Part 1: Data Science Positions:

Here we will explore Data Scientist Jobs in an around the United States

Load Data:

data <- read_csv("../data/data-science-jobs/ds_salaries.csv")
## New names:
## Rows: 607 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): experience_level, employment_type, job_title, salary_currency, empl... dbl
## (5): ...1, work_year, salary, salary_in_usd, remote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(data)
## # A tibble: 6 × 12
##    ...1 work_year experience_level employment_type job_title              salary
##   <dbl>     <dbl> <chr>            <chr>           <chr>                   <dbl>
## 1     0      2020 MI               FT              Data Scientist          70000
## 2     1      2020 SE               FT              Machine Learning Scie… 260000
## 3     2      2020 SE               FT              Big Data Engineer       85000
## 4     3      2020 MI               FT              Product Data Analyst    20000
## 5     4      2020 SE               FT              Machine Learning Engi… 150000
## 6     5      2020 EN               FT              Data Analyst            72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## #   employee_residence <chr>, remote_ratio <dbl>, company_location <chr>,
## #   company_size <chr>

Explore Data and Make Necessary Transformations:

## spc_tbl_ [607 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1              : num [1:607] 0 1 2 3 4 5 6 7 8 9 ...
##  $ work_year         : num [1:607] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : chr [1:607] "MI" "SE" "SE" "MI" ...
##  $ employment_type   : chr [1:607] "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr [1:607] "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
##  $ salary            : num [1:607] 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
##  $ salary_currency   : chr [1:607] "EUR" "USD" "GBP" "USD" ...
##  $ salary_in_usd     : num [1:607] 79833 260000 109024 20000 150000 ...
##  $ employee_residence: chr [1:607] "DE" "JP" "GB" "HN" ...
##  $ remote_ratio      : num [1:607] 0 0 50 0 50 100 100 50 100 50 ...
##  $ company_location  : chr [1:607] "DE" "JP" "GB" "HN" ...
##  $ company_size      : chr [1:607] "L" "S" "M" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   work_year = col_double(),
##   ..   experience_level = col_character(),
##   ..   employment_type = col_character(),
##   ..   job_title = col_character(),
##   ..   salary = col_double(),
##   ..   salary_currency = col_character(),
##   ..   salary_in_usd = col_double(),
##   ..   employee_residence = col_character(),
##   ..   remote_ratio = col_double(),
##   ..   company_location = col_character(),
##   ..   company_size = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
##       ...1         work_year    experience_level   employment_type   
##  Min.   :  0.0   Min.   :2020   Length:607         Length:607        
##  1st Qu.:151.5   1st Qu.:2021   Class :character   Class :character  
##  Median :303.0   Median :2022   Mode  :character   Mode  :character  
##  Mean   :303.0   Mean   :2021                                        
##  3rd Qu.:454.5   3rd Qu.:2022                                        
##  Max.   :606.0   Max.   :2022                                        
##   job_title             salary         salary_currency    salary_in_usd   
##  Length:607         Min.   :    4000   Length:607         Min.   :  2859  
##  Class :character   1st Qu.:   70000   Class :character   1st Qu.: 62726  
##  Mode  :character   Median :  115000   Mode  :character   Median :101570  
##                     Mean   :  324000                      Mean   :112298  
##                     3rd Qu.:  165000                      3rd Qu.:150000  
##                     Max.   :30400000                      Max.   :600000  
##  employee_residence  remote_ratio    company_location   company_size      
##  Length:607         Min.   :  0.00   Length:607         Length:607        
##  Class :character   1st Qu.: 50.00   Class :character   Class :character  
##  Mode  :character   Median :100.00   Mode  :character   Mode  :character  
##                     Mean   : 70.92                                        
##                     3rd Qu.:100.00                                        
##                     Max.   :100.00

Mutate Categorical Variables to be More Descriptive:

data_transformed <- data%>%
  mutate(experience_level = ifelse(experience_level=="EN", "Entry-Level", 
                                   ifelse(experience_level=="MI", "Manager-Level", 
                                          ifelse(experience_level=="SE", "Senior-Level",
                                                 ifelse(experience_level=="EX", "Executive-Level", experience_level)))))

data_transformed <- data_transformed%>%
  mutate(employment_type = ifelse(employment_type=="CT", "Contract-Work", 
                                   ifelse(employment_type=="FT", "Full-Time", 
                                          ifelse(employment_type=="PT", "Part-Time",
                                                 ifelse(employment_type=="FL", "FreeLance", employment_type)))))

data_transformed <- data_transformed%>%
  mutate(company_size = ifelse(company_size=="L", "Large", 
                                   ifelse(company_size=="M", "Medium", 
                                          ifelse(company_size=="S", "Small", company_size))))

data_transformed <- data_transformed%>%
  mutate(remote_ratio = ifelse(remote_ratio==0, "In-Person", 
                                   ifelse(remote_ratio==50, "Hybrid", 
                                          ifelse(remote_ratio==100, "Remote", remote_ratio))))


head(data_transformed)
## # A tibble: 6 × 12
##    ...1 work_year experience_level employment_type job_title              salary
##   <dbl>     <dbl> <chr>            <chr>           <chr>                   <dbl>
## 1     0      2020 Manager-Level    Full-Time       Data Scientist          70000
## 2     1      2020 Senior-Level     Full-Time       Machine Learning Scie… 260000
## 3     2      2020 Senior-Level     Full-Time       Big Data Engineer       85000
## 4     3      2020 Manager-Level    Full-Time       Product Data Analyst    20000
## 5     4      2020 Senior-Level     Full-Time       Machine Learning Engi… 150000
## 6     5      2020 Entry-Level      Full-Time       Data Analyst            72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## #   employee_residence <chr>, remote_ratio <chr>, company_location <chr>,
## #   company_size <chr>

Create Plots to Tell Stories:

Plot 1:

  • With this plot we can clearly see that as your experience level rises, you can expect to see a corresponding increase in salary.
  • It is also worth noting that different types of work see different effects, for example, contract work is much more volatile than Full Time salaries.
plot <- ggplot(data_transformed, aes(x=experience_level, y=salary_in_usd, fill=employment_type)) +
  geom_bar(stat='identity', position='dodge') + 
  labs(
    x="Experience Level",
    y="Salary in $USD",
    fill="Employment Type",
    title="The effects of Experience Level on Salary"
  ) + 
  scale_x_discrete(limits = c("Entry-Level", "Senior-Level", "Manager-Level","Executive-Level")) +
  theme_minimal()

ggplotly(plot)

Plot 2:

  • Note that In-Person only paid the highest for Medium Sized Companies, Remote actually had the highest payout for Large
  • Small companies pay grows step-wise with respect to the remote ratio (Hybrid->In-Person->Remote)
plot <- ggplot(data_transformed, aes(x=company_size, y=salary_in_usd, fill=remote_ratio)) +
          geom_bar(stat='identity', position='dodge') +
          labs(
            x="Company size",
            y="Salary in $USD",
            fill="Remote Ratio",
            title="The effects of Company Size and Remote Ratio on Salary"
          ) + 
          scale_x_discrete(limits = c("Small", "Medium", "Large")) + 
          theme_minimal()

ggplotly(plot)

Plot 3:

  • A lot of information, but the most interesting is that the US has the highest paying jobs by far with Small companies in Japan as a close second.
plot <- ggplot(data_transformed, aes(x=company_location, y=salary_in_usd, fill=company_size)) +
          geom_bar(stat='identity', position='dodge') +
          labs(
            x="Company Location",
            y="Salary in $USD",
            fill="Company Size",
            title="The effects of Company Location and Size on Salary"
          ) + 
          theme_minimal() +
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))



  
  
ggplotly(plot)

Part 1 Closing Remarks:

  • This has been a look into the data science job market examining salary as it relates to company size, the companies remote ratios, and the actual experience levels required for the positions. We will now be moving into more general Software Engineering Visuals for Part 2.